The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization
https://arxiv.org/abs/2403.17031